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(54) Voice message processing system and method 



(57) A voice message is processed in a distributed 
system by storing voice message data indicative of a 
plurality of voice messages on a distributed data store. 
A distributed data processor accesses the voice mes- 
sages and extracts desired information from the voice 



messages. The data processor then augments the data 
stored in the voice message data store with the extract- 
ed information. The user interface component provides 
user access to the voice messages with the augmented 
data. 
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Description 

BACKGROUND OF THE INVENTION 

[0001] The present invention relates to speech 
processing. More specifically, the present invention re- 
lates to voice message processing for processing voice 
messages received by a distributed system. 
[0002] Currently, many people receive* a large 
number of different types of messages from a wide va- 
riety of sources. For example, it is not uncommon for 
persons to receive tens of voice mail messages over a 
weekend. Exacerbating this problem is the recent use 
of unified messaging. In a unified messaging system, 
messages from a wide variety of sources, such as voice 
messages, electronic mail messages, fax messages, 
and instant messages, can be accessed in a seamlessly 
united manner However, compared to electronic mail 
messages and instant messaging systems, the type of 
information associated with voice messages is very lim- 
ited. 

[0003] For example, an electronic mail message "typ- 
ically includes the identity of the sender, a subject line, 
and an indication as to priority. Similarly, such messages 
can be fairly easily scanned, copied and pasted, since 
they are textual in nature.' By contrast, voice mail mes- 
sages typically do not have any indication of sender In 
systems equipped with caller identification, the incom- 
ing number can be identified and a presumed sender 
can also be identified, if the incoming number is asso- 
ciated with a person. However, such systems only track 
a telephone, and not a speaker. Voice mall messages 
typically do not include an indication as to subject or pri- 
ority, and are also difficult to scan, copy and paste, since 
they are vocal in nature, rather than written. 
[0004] The lack of information associated with voice 
messages make them more time consuming to process. 
For example, it is possible to eliminate many electronic 
mail messages simply by skimming the subject line or 
the sender line, and deleting them immediately from the 
mail box if they are not desired, or organizing them into 
a desired folder. In fact, this can even be done automat- 
ically by specifying rules for deleting mail messages 
from certain users or having certain subjects. 
[0005] Scanning voice mail messages, on the other 
hand, typically requires a much greater amount of time, 
because the user must listen to each message simply 
to extract the basic information such as the sender and 
subject. It is also virtually impossible, currently, to auto- 
matically create rules to pre-organize voice mail mes- 
sages (such as to organize them by sender, subject or 
urgency). 

SUMMARY OF THE INVENTION 

[0006] A voice message is processed in a distributed 
system by storing voice message data indicative of a 
plurality of voice messages on a distributed data store. 



A distributed data processor accesses the voice mes- 
sages and extracts desired information from the voice 
messages. The data processor then augments the data 
stored in the voice message data store with the extract- 
5 ed information. The user interface component provides 
user access to the voice messages with the augmented 
data. 

[0007] In one embodiment, the distributed voice data 
processor applies user selected rules to the data, such 

10 as sorting, generating alerts and alarms. 

[0008] The voice data processor illustratively extracts 
a wide variety of information, such as speaker Identity 
(using speaker identification models), speaker emotion, 
and speaking rate. The voice data processor can also 

15 normalize the messages to a desired speaking rate, se- 
lectable by the user. 

[0009] in one embodiment, the voice data processor 
also includes a transcription component for transcribing 
and summarizing the messages, and performing some 
20 natural language processing (such as semantic parsing) 
on the voice messages. 

[0010] The user input can provide the user with a wide 
range of user actuable inputs for manipulating the- voice 
messages. Such inputs can include, for example, a rate 
25 changing input for speeding up or slowing down the 
voice messages, inputs to set rules, displays of the var- 
ious information extracted from the voice message, and 
display of rules which have been selected or deselected 
by the user. 

30 

BRIEF DESCRIPTION OF THE DRAWINGS 
[0011] 

35 FIG. 1 is a block diagram of one illustrative environ- 
ment in which the present invention can be used, 
FIG. 2 is a more detailed block diagram showing a 
system in accordance with the present invention. 
FIG. 3 is a flow diagram generally illustrating the op- 
eration of the system shown in FIG. 2. 
FIG. 4 is a more detailed block diagram of a voice 
data processing system in accordance with one em- 
bodiment of the present invention. 
FIG. 5 is an illustration of one exemplary embodi- 
es ment of a user interface in accordance with the 
present invention. 

DETAILED DESCRIPTION OF ILLUSTRATIVE 
EMBODIMENTS 

50 

[0012] The present invention is implemented on a dis- 
tributed processing system to extract desired informa- 
tion from voice messages. The present invention ex- 
tracts the desired information and augments a voice da- 
55 ta store containing the voice messages with the extract- 
ed information. A user interface is provided such that 
the voice messages can *be easily manipulated given 
the augmented information that has been added to 
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them. 

By distributed, the present description means a non- 
server based system, but a system under the control of 
the individual user, such as a desk top system, a per- 
sonal digital assistant (PDA), a telephone, a laptop com- 
puter, etc. Therefore, when the present description dis- 
cusses a distributed processor, for instance, the present 
description means a processor residing on a device 
which may be part of a network but which is under the 
personal control of the user, rather than on a server, for 
example. 

[001 3] FIG. 1 illustrates an example of a suitable com- 
puting system environment 100 on which the invention 
may be implemented. The computing system environ- 
ment 1 00 is only one example of a suitable computing 
environment and is not intended to suggest any limita- 
tion as to the scope of use or functionality of the inven- 
tion. Neither should the computing environment 1 00 be 
interpreted as having any dependency or requirement 
relating to any one or combination of components illus- 
trated in the exemplary operating environment 100. 
[0014] The invention is operational with numerous 
other general purpose or special purpose computing 
system environments or configurations. Examples of 
well known computing systems, environments, and/or 
configurations that may be suitable for use with the in- 
vention include, but are not limited to, personal comput- 
ers, hand- held or laptop devices, multiprocessor sys- 
tems, microprocessor-based systems, set top boxes, 
programmable consumer electronics, network PCs, 
minicomputers, mainframe computers, distributed com- 
puting environments that include any of the above sys- 
tems or devices, and the like. 

[001 5] The invention may be described in the general 
context of computer-executable instructions, such as 
program modules, being executed by a computer. Gen- 
erally, program modules include routines, programs, ob- 
jects, components, data structures, etc. that perform 
particular tasks or implement particular abstract data 
types. The invention may also be practiced in distributed 
computing environments where tasks are performed by 
remote processing devices that are linked through a 
communications network. In a distributed computing en- 
vironment, program modules may be located in both lo- 
cal and remote computer storage media including mem- 
ory storage devices. 

[001 6J With reference to FIG. 1 , an exemplary system 
for implementing the invention includes a general pur- 
pose computing device In the form of a computer 110. 
Components of computer 110 may include, but are not 
limited to, a processing unit 120, a system memory 1 30, 
and a system bus 1 21 that couples various system com- 
ponents including the system memory to the processing 
unit 120. The system bus 121 may be any of several 
types of bus structures including a memory bus or mem- 
ory controller, a peripheral bus, and a local bus using 
any of a variety of bus architectures. By way of example, 
and not limitation, such architectures include Industry 



Standard Architecture (ISA) bus, Micro Channel Archi- 
tecture (MCA) bus. Enhanced ISA (EISA) bus, Video 
Electronics Standards Association (VESA) local bus, 
and Peripheral Component Interconnect (PCI) bus also 
5 known as Mezzanine bus. 

[0017] Computer 110 typically includes a variety of 
computer readable media. Computer readable media 
can be any available media that can be accessed by 
computer 1 1 0 and includes both volatile and nonvolatile 
*0 media, removable and non-removable media. By way 
of example, and not limitation, computer readable media 
may comprise computer storage media and communi- 
cation media. Computer storage media includes both 
volatile and nonvolatile., removable and non -removable 
'5 media implemented in any method or technology for 
storage of information such as computer readable in- 
structions, data structures, program modules or other 
data. Computer storage media includes, but is not lim- 
ited to, RAM, ROM, EEPROM, flash memory or other 
20 memory technology, CD-ROM, digital versatile disks 
(DVD) or other optical disk storage, magnetic cassettes, 
magnetic tape, magnetic disk storage or other magnetic 
storage devices, or any other medium which can be 
used to store the desired information and which can be 
25 accessed by computer 11 0. Communication media typ- 
ically embodies computer readable instructions, data 
structures, program modules or other data in a modu- 
lated data signal such as a carrier WAV or other trans- 
port mechanism and includes any information delivery 
30 media. The term "modulated data signal" means a sig- 
nal that has one or more of its characteristics set or 
changed in such a mariner as to encode information in 
the signal. By way of example, and not limitation, com- 
munication media includes wired media such as a wired 
35 network or direct-wired connection, and wireless media 
such as acoustic, RF, infrared and other wireless media. 
Combinations of any of the above should also be includ- 
ed within the scope of computer readable media. 
[0018] The system memory 130 includes computer 
*o storage media in the form of volatile and/or nonvolatile 
memory such as read only memory (ROM) 1 31 and ran- 
dom access memory (RAM) 132. A basic input/output 
system 133 (BIOS), containing the basic routines that 
help to transfer information between elements within 
45 computer 1 1 0, such as during start-up, is typically stored 
in ROM 131. RAM 132 typically contains data and/or 
program modules that are immediately accessible to 
and/or presently being operated on by processing unit 
120. By way of example, and not limitation, FIG. 1 ill us - 
50 trates operating system 1 34, application programs 1 35, 
other program modules 136, and program data 137. 
[0019] The computer 110 may also include other rc- 
movable/non -removable volatile/nonvolatile computer 
storage media. By way of example only, FIG. 1 illus- 
55 trates a hard disk drive 141 that reads from or writes to 
non-removable, nonvolatile magnetic media, a magnet- 
ic disk drive 151 that reads from or writes to a remova- 
ble, nonvolatile magnetic disk 152, and an optical disk 
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drive 155 that reads from or writes to a removable, non- 
volatile optical disk 1 56 such as a CD ROM or other op- 
tical media. Other removable/non-removable, volatile/ 
nonvolatile computer storage media that can be used in 
the exemplary operating environment include, but are 
not limited to, magnetic tape cassettes, flash memory 
cards, digital versatile disks, digital video tape, solid 
state RAM, solid state ROM, and the like. The hard disk 
drive 141 is typically connected to the system bus 121 
through a non-removable memory interface such as in- 
terface 140, and magnetic disk drive 151 and optical 
disk drive 1 55 are typically connected to the system bus 
121 by a removable memory interface, such as interface 
150. 

[0020] The drives and their associated computer stor- 
age media discussed above and illustrated in FIG. 1, 
provide storage of computer readable instructions, data 
structures, program modules and other data for the 
computer 110. in FIG. 1, for example, hard disk drive 
141 is illustrated as storing operating system 144, ap- 
plication programs 145, other program modules 146, 
and program data 1 47. Note that these components can 
either be the same as or different from operating system 
134, application programs 135, other program modules 
136, and program data 137. Operating system 144, ap- 
plication programs 145, other program modules 146, 
and program data 147 are given different numbers here 
to illustrate that, at a minimum, they are different copies. 
[0021] A user may enter commands and information 
into the computer 110 through input devices such as a 
keyboard 1 62, a microphone 163, and a pointing device 
161 , such as a mouse, trackball or touch pad. Other in- 
put devices (not shown) may include a joystick, game 
pad, satellite dish, scanner, or the like. These and other 
input devices are often connected to the processing unit 
120 through a user input interface 160 that is coupled 
to the system bus, but may be connected by other inter- 
face and bus structures, such as a parallel port, game 
port or a universal serial bus (USB). A monitor 191 or 
other type of display device is also connected to the sys- 
tem bus 121 via an interface, such as a video interface 
190. In addition to the monitor, computers may also in- 
clude other peripheral output devices such as speakers 
197 and printer 196, which may be connected through 
an output peripheral interface 1 95. 
[0022] The computer 1 1 0 may operate in a networked 
environment using logical connections to one or more 
remote computers, such as a remote computer 1 80. The 
remote computer 1 80 may be a personal computer, a 
hand-held device, a server, a router, a network PC, a 
peer device or other common network node, and typi- 
cally includes many or all of the elements described 
above relative to the computer 110. The logical connec- 
tions depicted in FIG. 1 include a local area network 
(LAN) 1 71 and a wide area network (WAN) 1 73, but may 
also include other networks. Such networking environ- 
ments are commonplace in offices, enterprise-wide 
computer networks, intranets and the' Internet. 



[0023] When used in a LAN networking environment, 
the computer 110 is connected to the LAN 171 through 
a network interface or adapter 170. When used in a 
WAN networking environment, the computer 110 typi- 

5 cally includes a modem 172 or other means for estab- 
lishing communications over the WAN 173, such as the 
Internet. The modem 172, which may be internal or ex- 
ternal, may be connected to the system bus 1 21 via the 
user input interface 160, or other appropriate mecha- 

10 nism. In a networked environment, program modules 
depicted relative to the computer 110, or portions there- 
of, may be stored in the remote memory storage device. 
By way of example, and not limitation, FIG. 1 illustrates 
remote application programs 185 as residing on remote 

15 computer 180. It will be appreciated that the network 
connections shown are exemplary and-other means of 
establishing a communications link between the com- 
puters may be used. 

[0024] FIG. 2 is a more detailed block diagram of a 

20 voice message processing system 200 in accordance 
with one embodiment of the present invention. System 
200 illustratively includes voice data input component 
202, voice data store 204, user interface component 
206, and^ voice data processor 208. Voice data input 

25 component 202 may illustratively include a telephone in 
cases where the voice data includes voice mail messag- 
es, a microphone where the voice data is recorded lec- 
tures or conversations, for example, and it can be other 
components, such as a radio, a compact disc player, etc. 

30 [0025] Voice data store 204 is illustratively a portion 
of memory which stores the voice data, such as WAV 
files. User interface component 206 illustratively gener- 
ates a user interface that can be invoked by the user to 
manipulate and organize the voice messages stored in 

35 voice data store 204. Voice data processor 208 illustra- 
tively includes information extraction component 210 
that extracts useful information from the voice messag- 
es and rule application component 21 2 that applies us- 
er-selected rules to the voice messages. 

to [0026] FIG. 3 is a flow diagram that illustrates the gen- 
eral operation of system 200. Voice messages are first 
received from data input component 202 and stored in 
voice data store 204. This is indicated by block 214 in 
FIG. 3. Information extraction component 210 periodi- 

45 cally, or intermittently, accesses data store 204 to deter- 
mine whether any new voice messages have been 
stored in data store 204 since the last time it was ac- 
cessed by information extraction component 210. This 
is indicated by blocks 216 and 218 in FIG. 3. If no new 

50 messages have been stored in voice data store 204 
since the last time it was accessed by information ex- 
traction component 21 0, then processing simply revolts 
to block 216. 

[0027] However, if, at block 21 8, information extrac- 
ts tion component 210 comes upon new voice messages 
which have not been processed, then it subjects those 
new messages to voice data processing and extracts 
desired information from the new messages. This is in- 
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dicated by block 220. Some examples of the desired in- 
formation will be discussed in greater detail below, but 
it may illustratively be suited to enhance organization 
and manipulation of voice messages in data store 204 
and to enhance application of rules to those messages. 
[0028] In any case, once the desired information has 
been extracted from the new messages, the information 
(corresponding to the new messages) which is stored 
in voice data store 204 is augmented, with the additional 
information which has just been extracted by informa- 
tion extraction component 210. This is indicated by 
block 222 in FIG. 3, 

[0029] The type of information extracted from the 
voice mail messages can vary widely, as discussed 
above ; but a number of types of information which can 
be extracted to enable a user to more efficiently process 
voice messages include the speaker's identity, the 
speaker's speaking rate, the speaker's emotional state, 
the content of the message, etc. FIG. 4 is a block dia- 
gram which illustrates one embodiment of information 
extraction component 210 for extracting these types of 
information. Of course, other information or different in- 
formation can be extracted as well, 
[0030] FIG. 4 illustrates that information extraction 
component 210 illustratively includes model training 
component 300, speaker identification component 302, 
speaker identification models 304, acoustic feature ex- 
traction component 306. emotion identifier 308, rate nor- 
malization component 310, speech-to-text component 
312 and natural language processor 31 4. In one embod- 
iment, the new message voice data 31 6 is obtained from 
voice data store 204. Data 316 is illustratively a WAV 
file r or other file, that represents a new voice message 
stored in data store 204, which has not yet been proc- 
essed by information extraction component 210. 
[0031] In one embodiment, data 316 is provided to 
speaker identification component 302. Component 302 
accesses speaker models 304 and generates a speaker 
identification output (speaker ID) 320 indicative of an 
identity of the speaker. Speaker identification compo- 
nent 302 and models 304 can illustratively be any known 
speaker identification component and speaker identifi- 
cation models trained on specific speakers. Speaker 
identification output 320 can be a textual name of a 
speaker, an encoded identifier, or any identifier as- 
signed by a user, 

[0032] In the event that component 302 can not iden- 
tify a speaker (for example, if models 304 do not contain 
a model associated with the speaker of the new mes- 
sage) component 302 illustratively provides speaker 
identification output 320 indicating that the identity of the 
speaker is unknown. In that instance, when the user re- 
views the new message and the speaker ID 320 is dis- 
played as unknown, the user can illustratively actuate a 
user input on the user interface (discussed in greater 
detail below with respect to FIG. 5). This causes model 
training component 300 to obtain the WAV file (or other 
voice data) associated with the new message. Model 



training component 300 then trains a speaker identifica- 
tion model corresponding to this speaker and associ- 
ates it with a speaker identification input by the user, or 
with a default speaker identification. Thus, the next time 
5 a voice message is processed from that speaker, speak- 
er identification component 302 produces the accurate 
speaker ID 320 because it has a speaker identification 
model 304 associated with the speaker. 
[0033] Modei training component 300 can also refine 

10 models where the speaker identification component 302 
has made a mistake. If the system makes a mistake, the 
user illustratively types the correct name in a window on 
a user interface and enters a user input command com- 
manding modei training component 300 to automatically 

15 train up a new speaker model 304 for that particular 
speaker. The user can also choose to update the models 
during use so that speaker identification becomes more 
accurate in the future, the more the system Is used. Con- 
versely, training component 300 can incrementally up- 

20 date speaker models 304 in an unsupervised fashion. 
For example, if the user accesses the new voice mes- 
sage, which displays the speaker identity, and the user 
does not change the user identity, then model training 
component 300 can access the voice data associated 

25 with that message and refine its model corresponding 
to that speaker. 

[0034] Speaker identification component 302 can al- 
so provide, along with speaker ID 320, a confidence 
score indicating how confident it is with the recognized 
30 identity. Based on a user's confirmation of the system's 
decision, speaker identification component 302 can au- 
tomatically update its parameters to improve perform- 
ance overtime. 

[0035] In accordance with another embodiment of the 
35 present invention, information extraction component 
31 0 includes the acoustic feature extraction component 
306 for extracting desired acoustic information from 
voice data 31 6 to generate other data helpful to the user 
in manipulating the voice messages. For example, by 
40 extracting certain acoustic features, emotion identifier 
308 can identify a predicted emotion of the speaker and 
output speaker emotion ID 322 indicative of that emo- 
tion. 

[0036] Emotion identifier 308 can be any known emo- 
45 tion identifier, and can also be that described in the pa- 
per entitled EMOTION DETECTION FROM SPEECH 
TO ENRICH MULTIMEDIA CONTENT, by F. Yu et al., 
2001 . The system classifies emotions into general cat- 
egories, such as anger, fear, and stress. By using such 
50 information, the system can easily classify the urgency 
of the message based on the sender and the emotional 
state of the sender. 

[0037] In one illustrative embodiment, acoustic fea- 
ture extraction component 306 extracts the pitch of the 
55 incoming speech and uses a plurality of derivatives of 
the pitch signal as basic features. Those features are 
then input into a support vector machine in emotion 
identifier 308 which categorizes each sentence as hap- 



5 



BNSDOCID: <EP 134S394A1_L> 



9 



EP 1 345 394 A1 



10 



py, sad, or angry. The support vector machines are 
each, illustratively, binary classifiers. Therefore, emo- 
tion identifier 308 can decide that multiple emotions ex- 
ist in each sentence, with varying weights. This corre- 
sponds to the fact that multiple emotions can exist in a 
single sentence. Thus, speaker emotion identification 
output 322 can display all of those emotions, with cor- 
responding weights, or it can simply display the strong- 
est emotion, or any other combination of emotions. 
[0038] In one embodiment, acoustic feature extrac- 
tion component 306 also illustratively extracts a speak- 
ing rate of the message. This can be done using a 
number of different approaches. For example, acoustic 
feature extraction component 306 can take a Cepstral 
measurement to determine how fast the Cepstral pat- 
tern associated with the new voice message is chang- 
ing.. This provides an indication as to the rate of speech 
(in, for example, words per minute) for the new voice 
message. 

[0039] In one embodiment, rate normalization compo- 
nent 31 0 is used. In accordance with that embodiment, 
the user can input a desired speaking rate (or can 
choose one from a pre-set list). Rate normalization com- 
ponent 31 0 then receives the speaking rate associated 
with the new voice message from acoustic feature ex- 
traction component 306 and normalizes the speaking 
rate for that message to the normalized rate selected by 
the user. Rate normalization component 310 then out- 
puts a rate-normalized speech data file (e.g., a WAV file) 
normalized to the desired rate, as indicated by block 
324. That file 324 is illustratively used at the user inter- 
face such that the voice message is spoken at the nor- 
malized rate when the user accesses the new message. 
Of course, the system can also retain the original mes- 
sage as well. 

[0040] In one illustrative embodiment, in order to nor- 
malize the speaking rate, rate normalization component 
310 evaluates the speaking rate of the new voice mes- 
sage and adjusts the speaking rate of each sentence 
with a known time scale modification algorithm. The sys- 
tem can also reduce the length of silence and pause in- 
tervals within the waveform for more efficient listening. 
[0041] In accordance with another embodiment of the 
present invention, information extraction component 
210 also Includes a speech-to-text component 312. 
Component 312 Illustratively includes a speech recog- 
nizerwhich reduces the voice data corresponding to the 
new message to a textual transcription that can be pro- 
vided to optional natural language processor 314. Of 
course, speech-to-text component 312 can simply out- 
put the message transcription 330, which corresponds 
to the entire transcription of the new voice message in- 
dicated by data 316. However, where natural language 
processor 314 is provided, natural language processing 
can be applied to the transcription as well. 
[0042] In one embodiment, natural language proces- 
sor 31 4 includes summarization component 332 and se- 
mantic parser 334. Summarization component 332 is il- 



lustratively a known processing subsystem for summa- 
rizing a textual input. Summarization component 332 
thus outputs a message summary 336 which corre- 
sponds to a short summary of the voice message. 

5 [0043] In an embodiment in which semantic parser 
334 is provided, the textual transcription generated by 
speech-to-text component 312 is illustratively input to 
semantic parser 334. Parser 334 then generates a se- 
mantic parse of the textual input to assign semantic la- 

10 bels to certain portions of the textual input and provide 
a semantic parse tree 338 at its output. One example of 
a semantic parse tree is an output that assigns semantic 
labels to portions of the voice message wherein the se- 
mantic labels correspond to various application schema 

?5 implemented by the computing system on which the 
voice message resides, such that the voice message 
can be more readily adapted to that schema. 
[0044] Once information extraction component 210 
has generated allof these outputs, rule application com- 

20 ponent 212 (shown in FIG. 2) can execute user desig- 
nated rules based on the voice data 31 6 and the extract- 
ed information (320, 322, 324, 330, 336 and 338) in or- 
der to enhance organization of the voice messages. For 
example, the user may select a rule that causes rule ap- 

25 plication component 21 2 to sort the voice messages by 
speaker, to filter them into different directories, to sort 
or filter the messages based on a subject (such as the 
message summary 336) or to sort by date. Rule appli- 
cation component 212 can also be employed to apply 

30 other rules, such as to alert the user based on certain 
attributes of the message, such as the speaker emotion 
322, the speaker identity 320, or the message content 
(from message transcription 330, message summary 
336 or semantic parse 338). Rule application compe- 
ls nent 212 can also be configured to delete messages 
from certain people or after a certain amount of time has 
elapsed since the message has been received. Rule ap- 
plication component 212 can also generate alarms 
based on predetermined criteria, such as the number of 

40 messages stored, the speaker identity 320, speaker 
emotion 322, etc. Of course, a wide variety of other rules 
can be applied by rule application component 212 as 
well. 

[0045] FIG. 5 is an illustration of one embodiment of 
45 a user interface in accordance with one example of the 
present invention. It will of course be appreciated that a 
wide variety of other user interfaces can be used, or the 
user interface can contain the same information as that 
shown in FIG. 5, but can be configured differently. FIG. 
so 5 illustrates a user interface 400, which includes a dis- 
play portion 402 and a tool bar portion 404. Display por- 
tion 402 is shown generating a display generally indic- 
ative of the WAV file 403, or acoustic representation of 
the voice message currently selected. Display portion 
55 402 is also shown displaying the textual transcription 
405, and could also show a textual summary or a com- 
bination of any of those or other items of information. 
Display portion 402 also illustratively includes a display 
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portion 406 that displays the caller identity and day and ceive user rule inputs indicative of user-seiect- 
time of the call along with the caller's telephone number. ed rules and to apply the user-selected rules to 
[0046] Tool bar portion 404 also Illustratively includes the augmented VM data, 
a variety of user actuable inputs which the user can ac- 
tuate to manipulate or organize the voice messages. 5 3. The system of claim 2 wherein the distributed voice 
The inputs shown in FIG. 5 include, as examples, a de- data processor comprises: 
lete input 408 for deleting the message, start and stop 

buttons 41 0 and 41 2, respectively, for starting and stop- a speaker identification model data store stor- 

ping a playback of the voice message. FIG. 5 also shows ing at least one speaker identification model; 

a faster/slower wiper 41 6 which allows the userto speed 10 and 

up or slow down the rate at which the voice message is a speaker identification component configured 

played. Interface 400 in FIG. 5 can also include other to access the speaker identification model data 

user actuable inputs such as File and Print actuators store and provide an indication of an identity of 

used to store and print messages, and Get Message a speaker associated with the voice message 

and New Message actuators used to retrieve old or new *5 corresponding to the VM data, 
messages. Interface 400 also illustratively includes an 

autorate selector 41 8 which causes the message to be 4. The system of claim 3 wherein -the distributed voice 

automatically normalized to a desired rate. Further, in- data processor comprises: 
terface 400 illustratively includes emotion display 420 

that displays the sensed emotion. Of course, the user 20 a speaker model training component config- 

interface can contain a wide variety of other user actu- ured to' receive VM data and train a speaker 

able inputs which allow the user to configure the user identification model based on the VM data and 

interface to display text, the acoustic information, the a user input indicative of a speaker of a voice 

augmented information, and apply different rules etc. message corresponding to the VM data. 

[0047] It can thus be seen that the present invention 25 

provides a distributed processor for extracting desired 5. The system of claim 2 wherein the distributed voice 
information and augmenting a voice message data store data processor comprises: 
with the desired information. The desired information il- 
lustratively is of a nature that helps the user to organize, an acoustic feature extractor extracting acous- 
sort and review or process voice messages. 30 u c features from the VM data, the acoustic fea- 
[0048] Although the present invention has been de- tures being indicative of the desired informa- 
scribed with reference to particular embodiments, work- tion. 
ers skilled in the art will recognize that changes may be 

made in form and detail without departing from the spirit 6. The system of claim 4 wherein the acoustic feature 

and scope of the invention. 35 extractor is configured to extract features indicative 

of , a speaker emotion and provide an emotion out- 
put indicative of the speaker's emotion. 

Claims 

7. The system of claim 4 wherein the acoustic feature 

1 . A voice message processing system, comprising: 40 extractor is configured to extract features indicative 

of a speaking rate and provide a rate output tndic- 
a distributed voice message (VM) data store ative of the speaking rate, 
storing voice message data indicative of a plu- 
rality of voice messages; 8. The system of claim 7 wherein the distributed voice 
a distributed voice data processor, coupled to *s data processor comprises: 
the VM data store, configured to access the 

voice messages, extract desired information a rate normalization component configured lo 

from the voice messages and augment the VM receive the rate output and normalize an asso- 

data stored in the VM data store with the de- ciated voice message to a preselected speak- 

sired information; and so j n g rate, 
a user interface component coupled to the VM 

data store and configured to provide user ac- 9. The system of claim 2 wherein the distributed voice 

cess to the augmented VM data. data processor comprises: 

2. The system of claim 1 wherein the distributed voice 55 a speech-to-text component configured to gen- 
data processor comprises: erate a textual output indicative of a content of 

a voice message. 

a rule application component configured to re- 
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10. The system of claim 9 wherein the speech-to-text 
component is con figured to generate a transcription 
of the voice message as the textual output. 

11. The system of claim 9 wherein the distributed voice 
data processor comprises: 

a summarization component configured to gen- 
erate a summary of the voice message. 

12. The system of claim 9 wherein the distributed voice 
data processor comprises: 

a semantic parser configured to generate a se- 
mantic parse of at least a port-ion ot the voice 
message. 

13. The system of claim 2 wherein the rule application 
component sorts voice messages based on the de- 
sired information. 

14. The system of claim 2 wherein the rule application 
component generates alarms based on the desired 
information. 

15. The system of claim 2 wherein the user interface 
component generates a user interface exposing us- 
er-selectable inputs for manipulation of the voice 
message by the user. 

16. The system of claim 1 5 wherein the user-selectable 
inputs comprise: 

a rate changing input which, when actuated by 
a user, changes a speaking rate associated 
with voice messages. 

17. The system of claim 15 wherein the user interface 
displays a textual indication of a content of a voice 
message. 

18. The system of claim 15 wherein the user interface 
displays an identity indication indicative of an iden- 
tity of a speaker of a voice message. 

19. The system of claim 15 wherein the user interface 
displays an emotion indicator indicative of an emo- 
tion of a speaker of a voice message. 

20. The system of claim 15 wherein the user interface 
displays a rule indicator indicative of rules being ap- 
plied. 

21. A method of processing voice messages, compris- 
ing: 

storing the voice messages at a distributed 
voice message (VM) data store; 



intermittently accessing the VM data store to 
determine whether a new voice message has 
been stored; 

for each new voice message, processing the 
5 new voice message at a distributed processor 

to obtain extracted data including speaker iden- 
tity, acoustic features indicative of desired in- 
formation, and a textual representation of a 
content of the new voice message; and 
io augmenting data in the VM data store with the 

extracted data. 

22. The method of claim 21 wherein processing the new 
voice message to obtain acoustic features compris- 
es es: 

obtaining acoustic features indicative of an 
emotion of a speaker of the new voice message 
and generating a speaker emotion output indic- 
ia ative of the speaker's emotion. 

23. The method of claim 21 wherein the acoustic fea- 
tures include a speaking rate indicator indicative of 
a speaking rate of the speaker of the new voice 

25 message, and further comprising: 

normalizing the speaking rate to a user-select- 
ed speaking rate. 

30 24. The method of claim 21 wherein obtaining speaker 
identity includes providing an unknown output when 
speaker identity is determined to be unknown and 
further comprising: 

35 receiving a user input indicative of a speaker 

identity for the new voice message; and 
training a speaker identification model based 
on the new voice message and the user input. 

40 25. The method of claim 21 and further comprising: 

receiving a rules input indicative of user-select- 
ed rules to be applied to the new voice mes- 
sage; and 

45 applying the user-selected rules based on the 

extracted data. 

26. The method of claim 21 and further comprising: 

so semantically parsing the textual representation 

of the new voice message. 

27. The method of claim 21 and further comprising: 

55 generating a user interface to the VM data 

store, the user interface including user-actua- 
ble inputs for manipulating the voice messages 
in the VM data store. 
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